Oxford TRECVid 2007 \u2013 Notebook paper

نویسندگان

  • James Philbin
  • Ondrej Chum
  • Josef Sivic
  • Vittorio Ferrari
  • Manuel J. Marín-Jiménez
  • Anna Bosch
  • Nicholas Apostolof
  • Andrew Zisserman
چکیده

The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on sparse visual features. One used a standard bag-of-words representation, while the other additionally used a lower-dimensional “topic”-based representation generated by Latent Dirichlet Allocation (LDA). For both methods, we trained χ-based SVM classifiers for all high-level features using publicly available annotations [3]. In addition, for certain features, we took a more targeted approach. Features based on human actions, such as “Walking/Running” and “People Marching”, were answered by using a robust pedestrian detector on every frame, coupled with an action classifier targeted to each feature to give highprecision results. For “Face” and “Person”, we used a realtime face detector and pedestrian detector, and for “Car” and “Truck”, we used a classifier which localized the vehicle in each image, trained on an external set of images of side and front views. We submitted 6 different runs. OXVGG_1(0.073 mAP) was our best run, which used a fusion of our LDA and bag-of-words results for most features, but favored our specific methods for features where these were available. OXVGG_2(0.062 mAP) and OXVGG_3(0.060 mAP) were variations on this first run, using different parameter settings. OXVGG_4(0.060 mAP) used LDA for all features and OXVGG_5(0.059 mAP) used bag-of-words for all features. OXVGG_6(0.066 mAP) was a variation of our first run. We came first in “Mountain” and were in the top five for “Studio”, “Car”, “Truck” and “Explosion/Fire”. Our main observation this year is that we can boost retrieval performance by using tailored approaches for specific concepts. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on different image representations. The main differences between this year’s system and last year’s was the availability of many more expansion methods and a “temporal zoom” facility which proved invaluable to answering the many action queries in this year’s task. We submitted just one run, I_C_2_VGG_I_1_1, which came second overall with an mAP of 0.328, and came first in 5 queries. 1 High-level Feature Extraction For the high-level feature task, we used two generic methods which were run for all topics and used more specialized methods for particular topics. These results were then fused to create the final submission. 1.1 Generic Approaches For the following approaches, we used a reduced subset of MPEG i-frames from each shot, found by clustering i-frames within a shot. Our approach here was to train an SVM for the concept in question, then score all frames in the test set using their distance from the discriminating hyper-plane. We then subsequently ranked the test shots by the maximum score over the reduced i-frames. We have developed two different methods for this task, each differing only in their representations. The first uses a standard bag-of-words representation and the second concatenates this bag-of-words representation with a topic-based LDA representation. 1.1.1 Bag of visual word representation The first method uses a bag of (visual) words [29] representation for the frames, where positional relationships between features are ignored. This representation has proved successful for classifying images according to whether they contain visual categories (such as cars, horses, etc) by training an SVM [10]. Here we use the kernel formulation proposed by [33]. Figure 1: An example of Hessian-Laplace regions used in the bag of words method. Left: original image; right: sparse detected regions overlaid as ellipses. Features and bag of words representation. We used Hessian Laplace(HL) [21] interest points coupled with a SIFT [20] descriptor. This combination of detection and description generates features which are approximately invariant to an affine transformation of the image, see figure 1. These features are computed for all reduced i-frames. The “visual vocabulary” is then constructed by running unsupervised K-means clustering over both the training and test data. The K-means cluster centres define the visual words. We used a vocabulary size of K = 10, 000 visual words. The SIFT features in each reduced i-frame are then assigned to the nearest cluster centre, to give the visual word representation, and the number of occurrences of each visual word is recorded in a histogram. This histogram of visual words is the bag of visual words model for that frame. Topic-based representation We use the Latent Dirichlet Allocation [5, 16] model to obtain a low dimensional representation of the bag-of-visual-words feature vectors. Similar low dimensional representations have been found useful in the context of unsupervised [26, 28] and supervised [6, 25] object and scene category recognition, and image retrieval [17, 27]. We pool together both TRECVid training and test data in the form of 10,000 dimensional bag-ofvisual words vectors and learn 20, 50, 100, 500 and 1,000 topic models. The models are fitted using the Gibbs sampler described in [16]. These representations are concatanated into a single feature vector, each one independantly normalized, such that the bag-of-words and the individual topic representations are each given equal weight. This approach was found to work best using a validation set taken from the training data. SVM classification. To predict whether a keyframe from the test set belongs to a concept, an SVM classifier is trained for each concept. Specifically, a kernel SVM with χ kernel K(p, q) = e−αχ 2(p,q)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Oxford-IIIT TRECVID 2010 - Notebook paper

1. Oxford-IIIT combined: a spatial pyramid intersection kernel SVM image classifier, a sliding-window randomforest object detector, a sliding-window intersection kernel SVM object detector, and a discriminative constellation model facial feature extractor. For each of the twenty features, methods were ranked based on their performance on a validation set and associated to successive runs by dec...

متن کامل

Fudan University at TRECVID 2010 : Semantic Indexing

In this notebook paper we describe our participation in the NIST TRECVID 2010 evaluation. We took part in semantic indexing task of benchmark this year. For semantic indexing, we submitted 3 automatic runs using only IACC training data: Fudan.TV10.3: this run is based on visual features of keyframes. Fudan.TV10.2: this run is based on visual features of keyframes and object detection. Fudan.TV1...

متن کامل

IRIT @ TRECVid 2010 : Hidden Markov Models for Context-aware Late Fusion of Multiple Audio Classifiers

This notebook paper describes the four runs submitted by IRIT at TRECVid 2010 Semantic Indexing task. The four submitted runs can be described and compared as follows: • Run 4 – late fusion (weighted sum) of multiple audio-only classifiers output • Run 3 – context-aware re-rank of run 4 using hidden Markov model • Run 2 – context-aware late fusion of multiple audio classifiers output with hidde...

متن کامل

INRIA-WILLOW at TRECVID 2010 : Surveillance Event Detection

This notebook paper presents a system evaluated in the Surveillance Event Detection (SED) task of TRECVid 2010 campaign. We investigate a generic statistical approach applied to seven event classes defined by the SED task. Our video representation is based on local space-time descriptors which are vectorquantized and aggregated into histograms within short temporal windows and spatial regions d...

متن کامل

TNO Instance Search at TRECVID 2010

What we learned from our runs: using a commercial facedetection package without tweaking on this (low image quality) dataset does not work. Using a small-sized (512 words) visual vocabulary computed on the query set significantly outperforms a much larger (4096 words) visual vocabulary on the whole dataset. One can build an image-retrieval system using open-source components. 2. INTRODUCTION In...

متن کامل

University of Sheffield at TRECVID 2006 High-level Feature Extraction

We present our approach to TRECVID 2006, high-level feature extraction task. We submitted one run with type ‘A’, annotating all required 39 features. The approach was based on textual information extracted from speech recogniser and machine translation outputs. They were aligned with shots and associated with highlevel feature references. A list of significant words was created for each feature...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007